Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 7 de 7
Filtrar
Mais filtros










Base de dados
Intervalo de ano de publicação
1.
PLoS Comput Biol ; 19(5): e1011162, 2023 05.
Artigo em Inglês | MEDLINE | ID: mdl-37220151

RESUMO

Natural products are chemical compounds that form the basis of many therapeutics used in the pharmaceutical industry. In microbes, natural products are synthesized by groups of colocalized genes called biosynthetic gene clusters (BGCs). With advances in high-throughput sequencing, there has been an increase of complete microbial isolate genomes and metagenomes, from which a vast number of BGCs are undiscovered. Here, we introduce a self-supervised learning approach designed to identify and characterize BGCs from such data. To do this, we represent BGCs as chains of functional protein domains and train a masked language model on these domains. We assess the ability of our approach to detect BGCs and characterize BGC properties in bacterial genomes. We also demonstrate that our model can learn meaningful representations of BGCs and their constituent domains, detect BGCs in microbial genomes, and predict BGC product classes. These results highlight self-supervised neural networks as a promising framework for improving BGC prediction and classification.


Assuntos
Produtos Biológicos , Genoma Bacteriano , Metagenoma , Família Multigênica/genética , Produtos Biológicos/metabolismo , Aprendizado de Máquina Supervisionado
2.
Bioinformatics ; 39(1)2023 01 01.
Artigo em Inglês | MEDLINE | ID: mdl-36355460

RESUMO

MOTIVATION: Multiple sequence alignments (MSAs) of homologous sequences contain information on structural and functional constraints and their evolutionary histories. Despite their importance for many downstream tasks, such as structure prediction, MSA generation is often treated as a separate pre-processing step, without any guidance from the application it will be used for. RESULTS: Here, we implement a smooth and differentiable version of the Smith-Waterman pairwise alignment algorithm that enables jointly learning an MSA and a downstream machine learning system in an end-to-end fashion. To demonstrate its utility, we introduce SMURF (Smooth Markov Unaligned Random Field), a new method that jointly learns an alignment and the parameters of a Markov Random Field for unsupervised contact prediction. We find that SMURF learns MSAs that mildly improve contact prediction on a diverse set of protein and RNA families. As a proof of concept, we demonstrate that by connecting our differentiable alignment module to AlphaFold2 and maximizing predicted confidence, we can learn MSAs that improve structure predictions over the initial MSAs. Interestingly, the alignments that improve AlphaFold predictions are self-inconsistent and can be viewed as adversarial. This work highlights the potential of differentiable dynamic programming to improve neural network pipelines that rely on an alignment and the potential dangers of optimizing predictions of protein sequences with methods that are not fully understood. AVAILABILITY AND IMPLEMENTATION: Our code and examples are available at: https://github.com/spetti/SMURF. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Algoritmos , Proteínas , Humanos , Alinhamento de Sequência , Proteínas/química , Redes Neurais de Computação , Sequência de Aminoácidos
3.
Pac Symp Biocomput ; 27: 34-45, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-34890134

RESUMO

The established approach to unsupervised protein contact prediction estimates coevolving positions using undirected graphical models. This approach trains a Potts model on a Multiple Sequence Alignment. Increasingly large Transformers are being pretrained on unlabeled, unaligned protein sequence databases and showing competitive performance on protein contact prediction. We argue that attention is a principled model of protein interactions, grounded in real properties of protein family data. We introduce an energy-based attention layer, factored attention, which, in a certain limit, recovers a Potts model, and use it to contrast Potts and Transformers. We show that the Transformer leverages hierarchical signal in protein family databases not captured by single-layer models. This raises the exciting possibility for the development of powerful structured models of protein family databases.


Assuntos
Biologia Computacional , Proteínas , Atenção , Humanos , Proteínas/genética , Alinhamento de Sequência
4.
Genome Res ; 31(2): 239-250, 2021 Feb.
Artigo em Inglês | MEDLINE | ID: mdl-33361114

RESUMO

Biosynthetic gene clusters (BGCs) are operonic sets of microbial genes that synthesize specialized metabolites with diverse functions, including siderophores and antibiotics, which often require export to the extracellular environment. For this reason, genes for transport across cellular membranes are essential for the production of specialized metabolites and are often genomically colocalized with BGCs. Here, we conducted a comprehensive computational analysis of transporters associated with characterized BGCs. In addition to known exporters, in BGCs we found many importer-specific transmembrane domains that co-occur with substrate binding proteins possibly for uptake of siderophores or metabolic precursors. Machine learning models using transporter gene frequencies were predictive of known siderophore activity, molecular weights, and a measure of lipophilicity (log P) for corresponding BGC-synthesized metabolites. Transporter genes associated with BGCs were often equally or more predictive of metabolite features than biosynthetic genes. Given the importance of siderophores as pathogenicity factors, we used transporters specific for siderophore BGCs to identify both known and uncharacterized siderophore-like BGCs in genomes from metagenomes from the infant and adult gut microbiome. We find that 23% of microbial genomes from premature infant guts have siderophore-like BGCs, but only 3% of those assembled from adult gut microbiomes do. Although siderophore-like BGCs from the infant gut are predominantly associated with Enterobacteriaceae and Staphylococcus, siderophore-like BGCs can be identified from taxa in the adult gut microbiome that have rarely been recognized for siderophore production. Taken together, these results show that consideration of BGC-associated transporter genes can inform predictions of specialized metabolite structure and function.

5.
Sci Adv ; 5(12): eaax5727, 2019 12.
Artigo em Inglês | MEDLINE | ID: mdl-31844663

RESUMO

Necrotizing enterocolitis (NEC) is a devastating intestinal disease that occurs primarily in premature infants. We performed genome-resolved metagenomic analysis of 1163 fecal samples from premature infants to identify microbial features predictive of NEC. Features considered include genes, bacterial strain types, eukaryotes, bacteriophages, plasmids, and growth rates. A machine learning classifier found that samples collected before NEC diagnosis harbored significantly more Klebsiella, bacteria encoding fimbriae, and bacteria encoding secondary metabolite gene clusters related to quorum sensing and bacteriocin production. Notably, replication rates of all bacteria, especially Enterobacteriaceae, were significantly higher 2 days before NEC diagnosis. The findings uncover biomarkers that could lead to early detection of NEC and targets for microbiome-based therapeutics.


Assuntos
Enterocolite Necrosante/genética , Fímbrias Bacterianas/genética , Microbioma Gastrointestinal/genética , Metagenômica , Enterobacteriaceae/genética , Enterocolite Necrosante/microbiologia , Fezes/microbiologia , Fímbrias Bacterianas/microbiologia , Humanos , Recém-Nascido , Recém-Nascido Prematuro , Doenças do Prematuro/genética , Doenças do Prematuro/microbiologia , Klebsiella/genética , Família Multigênica/genética
6.
Nucleic Acids Res ; 47(8): 4198-4210, 2019 05 07.
Artigo em Inglês | MEDLINE | ID: mdl-30805621

RESUMO

The ribosome exit tunnel is an important structure involved in the regulation of translation and other essential functions such as protein folding. By comparing 20 recently obtained cryo-EM and X-ray crystallography structures of the ribosome from all three domains of life, we here characterize the key similarities and differences of the tunnel across species. We first show that a hierarchical clustering of tunnel shapes closely reflects the species phylogeny. Then, by analyzing the ribosomal RNAs and proteins, we explain the observed geometric variations and show direct association between the conservations of the geometry, structure and sequence. We find that the tunnel is more conserved in the upper part close to the polypeptide transferase center, while in the lower part, it is substantially narrower in eukaryotes than in bacteria. Furthermore, we provide evidence for the existence of a second constriction site in eukaryotic exit tunnels. Overall, these results have several evolutionary and functional implications, which explain certain differences between eukaryotes and prokaryotes in their translation mechanisms. In particular, they suggest that major co-translational functions of bacterial tunnels were externalized in eukaryotes, while reducing the tunnel size provided some other advantages, such as facilitating the nascent chain elongation and enabling antibiotic resistance.


Assuntos
Archaea/genética , Bactérias/genética , Eucariotos/genética , Biossíntese de Proteínas , RNA Ribossômico/química , Proteínas Ribossômicas/química , Ribossomos/ultraestrutura , Sequência de Aminoácidos , Archaea/classificação , Archaea/metabolismo , Bactérias/classificação , Bactérias/metabolismo , Microscopia Crioeletrônica , Cristalografia por Raios X , Eucariotos/classificação , Eucariotos/metabolismo , Conformação de Ácido Nucleico , Filogenia , Dobramento de Proteína , Estrutura Secundária de Proteína , RNA Ribossômico/genética , RNA Ribossômico/metabolismo , Proteínas Ribossômicas/genética , Proteínas Ribossômicas/metabolismo , Ribossomos/classificação , Ribossomos/genética , Ribossomos/metabolismo , Alinhamento de Sequência , Homologia de Sequência de Aminoácidos
7.
Adv Neural Inf Process Syst ; 32: 9689-9701, 2019 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-33390682

RESUMO

Machine learning applied to protein sequences is an increasingly popular area of research. Semi-supervised learning for proteins has emerged as an important paradigm due to the high cost of acquiring supervised protein labels, but the current literature is fragmented when it comes to datasets and standardized evaluation techniques. To facilitate progress in this field, we introduce the Tasks Assessing Protein Embeddings (TAPE), a set of five biologically relevant semi-supervised learning tasks spread across different domains of protein biology. We curate tasks into specific training, validation, and test splits to ensure that each task tests biologically relevant generalization that transfers to real-life scenarios. We benchmark a range of approaches to semi-supervised protein representation learning, which span recent work as well as canonical sequence learning techniques. We find that self-supervised pretraining is helpful for almost all models on all tasks, more than doubling performance in some cases. Despite this increase, in several cases features learned by self-supervised pretraining still lag behind features extracted by state-of-the-art non-neural techniques. This gap in performance suggests a huge opportunity for innovative architecture design and improved modeling paradigms that better capture the signal in biological sequences. TAPE will help the machine learning community focus effort on scientifically relevant problems. Toward this end, all data and code used to run these experiments are available at https://github.com/songlab-cal/tape.

SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...